Immunotherapy datasets analyses
Immunotherapy datasets analyses
This part will clearly describe how to analyze APS, TMB and TIGS in immunotherapy datasets, including cancer type level and individual level. Raw data used for these analyses have been preprocessed, detail please read preprocessing part or TCGA pan-cancer analyses of this analysis report.
TIGS and pan-cancer objective response rate to PD-1 inhibition
Previous study has shown that TMB could predict pan-cancer ICI objective response rates (ORR). Here we will evaluate and compare the predictive power of APS, TIGS with TMB in pan-cancer ICI objective response rates prediction. The ORR for anti–PD-1 or anti–PD-L1 therapy will be plotted against the corresponding median APS, TIGS, TMB across multiple cancer types.
Through an extensive literature search, we identified 26 tumor types or subtypes for which data regarding the ORR are available. For each tumor type, we pooled the response data from the largest published studies that evaluated the ORR. We included only studies of anti–PD-1 or anti–PD-L1 monotherapy that enrolled at least 10 patients who were not selected for PD-L1 tumor expression (Identified individual studies and references are available in the manuscript Supplementary Table 3). The median tumor mutational burden for each tumor type was obtained from a validated comprehensive genomic profiling assay performed and provided by Foundation Medicine. The APS information for 23 tumor types were calculated based on TCGA datasets, and the APS for merkel cell carcinoma, cutaneous squamous cell carcinoma and small-cell lung cancer were calculated based on GEO microarray datasets.
Combination of datasets
In code implementation, we firstly combine APM score, TIGS score from TCGA studies and GEO datasets.
library(tidyverse)
#> ─ Attaching packages ────────────────────────────────────── tidyverse 1.2.1 ─
#> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.8
#> ✔ tidyr 0.8.2 ✔ stringr 1.3.1
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ─ Conflicts ──────────────────────────────────────── tidyverse_conflicts() ─
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
## retrieve APM score
load("results/Add_gsva_scc_sclc_merkel.RData")
buildDF = function(x, tumor_type = NULL){
stopifnot(!is.null(tumor_type))
APM = x$APM
res = data.frame(Project = tumor_type, APM = APM, stringsAsFactors = FALSE)
res
}
lapply(gsva.merkel, FUN = buildDF, tumor_type = "Merkel Cell Carcinoma") %>% purrr::reduce(rbind) -> df_merkel
lapply(gsva.scc, FUN = buildDF, tumor_type = "Cutaneous Squamous Cell Carcinoma") %>% purrr::reduce(rbind) -> df_scc
lapply(gsva.sclc, FUN = buildDF, tumor_type = "Small-Cell Lung Cancer") %>% purrr::reduce(rbind) -> df_sclc## combine datasets
load("results/TCGA_ALL.RData")
df_list = list(tcga_all %>% filter(!is.na(APM)), df_merkel, df_scc, df_sclc)
purrr::reduce(df_list, bind_rows) -> df_all
if (!file.exists("results/df_all.RData")) {
save(df_all, file = "results/df_all.RData")
}Using the combined dataset df_all, we can normalize APM score to [0, 1] region, then we calculate median APM score for each tumor type.
df_all %>%
mutate(N.APM = (APM - min(APM)) / (max(APM) - min(APM))) %>%
group_by(Project) %>%
summarise(MedianAPM = median(N.APM), NumberOfPatient = n()) -> sm_APM
sm_APM
#> # A tibble: 35 x 3
#> Project MedianAPM NumberOfPatient
#> <chr> <dbl> <int>
#> 1 ACC 0.444 79
#> 2 BLCA 0.588 407
#> 3 BRCA 0.323 1083
#> 4 CESC 0.734 303
#> 5 CHOL 0.623 36
#> 6 COAD 0.646 286
#> 7 Cutaneous Squamous Cell Carcinoma 0.506 4
#> 8 DLBC 0.816 48
#> 9 ESCA 0.353 184
#> 10 GBM 0.270 154
#> # ... with 25 more rowsLastly, we merge this data to 26 tumor types or subtypes for which data regarding the ORR, TMB are available by hand using Excel. This process has been double checked. With the merged data, we can explore association of ORR and APM score, ORR and TMB etc. in immunotherapy datasets.
Significant correlation between APS, TMB and the ORR
We plot ORR against APS, TMB respectively, and fit their relationship with linear model. We observe that significant correlation between APS, TMB and the ORR.
Load pan-cancer ORR data and show it as a table.
sm_data = read_csv("../data/summary_data_new_20180906.csv", col_types = cols())
DT::datatable(sm_data,
options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)Focus on cancer type with APS, fit data with linear model.
sm_data = sm_data %>% filter(!is.na(Pool_APM))
lm(Pool_ORR ~ Pool_APM, data = sm_data) -> fit1
lm(Pool_ORR ~ log(Pool_TMB), data = sm_data) -> fit2
summary(fit1)
#>
#> Call:
#> lm(formula = Pool_ORR ~ Pool_APM, data = sm_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.665 -9.829 -2.010 3.514 36.186
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -8.308 9.092 -0.914 0.36991
#> Pool_APM 50.418 16.481 3.059 0.00539 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13.51 on 24 degrees of freedom
#> Multiple R-squared: 0.2805, Adjusted R-squared: 0.2506
#> F-statistic: 9.358 on 1 and 24 DF, p-value: 0.00539
summary(fit2)
#>
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB), data = sm_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.089 -5.810 -3.520 2.417 39.262
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.200 5.218 -0.422 0.677029
#> log(Pool_TMB) 13.871 3.160 4.389 0.000196 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 11.87 on 24 degrees of freedom
#> Multiple R-squared: 0.4453, Adjusted R-squared: 0.4222
#> F-statistic: 19.26 on 1 and 24 DF, p-value: 0.0001964Of note, here we use
log(TMB)will not change the relationshipe between ORR and TMB.
We observe that 28% variance of ORR can be explained by APS, and 44% variance of ORR can be explained by TMB. The result of TMB is similar to NEJM paper Tumor Mutational Burden and Response Rate to PD-1 inhibition. This result show that TMB and APS are biomarker that predicting ORR of immunotherapy, TMB is better than APS.
Next we plot the linear model.
require(ggrepel)
#> Loading required package: ggrepel
require(scales)
#> Loading required package: scales
#>
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#>
#> discard
#> The following object is masked from 'package:readr':
#>
#> col_factor
ggplot(sm_data, aes(x=Pool_APM, y=Pool_ORR)) +
geom_point(aes(color=Patients_APM, size=Patients_ORR)) +
geom_smooth(method="lm", se=T) +
geom_text_repel(aes(label=Cancer_Type), size=3) +
labs(x="Median Normalized APM Score ", y="Objective Response Rate (%)",
size="Objective Response Rate\n(no. of patients evaluated)",
color = "APM Score\n(no. of tumor analyzed)") +
scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
scale_color_gradientn(colours = RColorBrewer::brewer.pal(5, name="OrRd")[-1],
breaks = c(50, 200, 500, 1000)) +
theme_bw()
ggplot(sm_data, aes(x=Pool_TMB, y=Pool_ORR)) +
geom_point(aes(color=Patients_TMB, size=Patients_ORR)) +
geom_smooth(method="lm", se=T) +
geom_text_repel(aes(label=Cancer_Type), size=3) +
labs(x="Median No. of Coding Somatic Mutation per MB", y="Objective Response Rate (%)",
color = "Tumor Mutational Burden\n(no. of tumor analyzed)",
size="Objective Response Rate\n(no. of patients evaluated)") +
scale_x_continuous(trans = log_trans(),
breaks = c(2, 10, 20, 30, 40, 50),
labels = c(2, 10, 20, 30, 40, 50)) +
scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
scale_color_gradientn(colours = RColorBrewer::brewer.pal(5, name="OrRd")[-1],
breaks = c(100, 1000, 5000, 10000)) +
theme_bw()TIGS definition and performance
From the aspect of biological process, TMB release is independent from processing and presentation of tumor antigens. Now that we have found TMB and APM are biomarkers which can predict object reponse rate of cancer type in immunotherapy, these two factors are independent, and are both required for tumor immunogenicity determination, why not integrate them as a new biomarker which should be better than TMB?
Therefore, theoretically, tumor immunogenicity can be represented as
\[ [“Tumor \quad antigenicity”] \times [“Antigen \quad processing \quad and \quad presenting \quad status”]. \]
We use log(TMB) here because log(TMB) has been proved to have a linear relationship with ORR. So the formula can be written as
\[ TIGS = APS_{normalized} \times log(TMB) \]
However, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB. So in this case, the TIGS formula is
\[ TIGS = APS_{normalized} \times log(TMB + 1) \]
When TMB < 1 mutation/Mb, the
log(TMB)would less than 0, if APS is very big, we would get a big minus number in this case, which against common sense.
Now we construct a linear model by integrate TMB and APM.
lm(Pool_ORR ~ log(Pool_TMB +1):Pool_APM, data = sm_data) -> fit3
summary(fit3)
#>
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB + 1):Pool_APM, data = sm_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -13.3507 -6.8209 -0.9202 2.3742 25.7436
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.601 4.050 -1.630 0.116
#> log(Pool_TMB + 1):Pool_APM 26.885 3.911 6.875 4.13e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9.245 on 24 degrees of freedom
#> Multiple R-squared: 0.6632, Adjusted R-squared: 0.6492
#> F-statistic: 47.27 on 1 and 24 DF, p-value: 4.134e-07We observe that 66% variance of ORR can be explained by TIGS, this result is indeed bettern than TMB.
sm_data$tigs = sm_data$Pool_APM * log(sm_data$Pool_TMB + 1)
ggplot(sm_data, aes(x=tigs, y=Pool_ORR)) +
geom_point(aes(size=Patients_ORR)) +
geom_smooth(method="lm", se=T) +
geom_text_repel(aes(label=Cancer_Type), size=3) +
labs(x="Tumor Immunogenicity Score ", y="Objective Response Rate (%)",
size="Objective Response Rate\n(no. of patients evaluated)") +
scale_size_continuous(breaks = c(50, 100, 500, 1000)) +
theme_bw() More exploration of TIGS formula
Would other combination of TMB and APS be better?
As mentioned above, we constructed the formula of TIGS according to our understanding of “tumor antigenicity”. Is there a better combination of these two factors?
There are three basic models we take into consideration.
- Model 1: \(ORR = d \times (a \times APS + b\times log(TMB)) + c\), i.e. \(TIGS = a \times APS + b \times log(TMB)\)
- Model 2: \(ORR = a \times (APS \times log(TMB)) + b\), i.e. \(TIGS = APS \times log(TMB)\) (This is the model we used above)
- Model 3: \(ORR = e \times (a \times APS + b\times log(TMB) + c\times APS \times log(TMB)) + d\), i.e. \(TIGS = a \times APS + b\times log(TMB) + c\times APS \times log(TMB)\)
a, b, c, d and e here are coefficients. In R, we don’t need to write them in models, it will take care of them.
Now, let’s observe the results.
# Model 1
lm(Pool_ORR ~ log(Pool_TMB)+Pool_APM, data = sm_data) %>% summary()
#>
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB) + Pool_APM, data = sm_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.628 -7.138 -1.461 3.109 29.314
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -20.490 7.375 -2.778 0.010686 *
#> log(Pool_TMB) 12.172 2.761 4.409 0.000203 ***
#> Pool_APM 39.417 12.643 3.118 0.004840 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 10.16 on 23 degrees of freedom
#> Multiple R-squared: 0.6101, Adjusted R-squared: 0.5762
#> F-statistic: 17.99 on 2 and 23 DF, p-value: 1.979e-05
# Model 2
lm(Pool_ORR ~ log(Pool_TMB):Pool_APM, data = sm_data) %>% summary()
#>
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB):Pool_APM, data = sm_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.382 -6.262 -2.004 2.911 25.404
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.857 3.496 -0.531 0.6
#> log(Pool_TMB):Pool_APM 25.092 3.707 6.769 5.31e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9.34 on 24 degrees of freedom
#> Multiple R-squared: 0.6563, Adjusted R-squared: 0.642
#> F-statistic: 45.83 on 1 and 24 DF, p-value: 5.306e-07
# Model 3
lm(Pool_ORR ~ log(Pool_TMB)*Pool_APM, data = sm_data) %>% summary()
#>
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB) * Pool_APM, data = sm_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.0233 -6.8783 -0.4445 3.4839 25.8232
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.24 16.85 0.726 0.4753
#> log(Pool_TMB) -12.93 12.08 -1.071 0.2959
#> Pool_APM -23.41 31.80 -0.736 0.4693
#> log(Pool_TMB):Pool_APM 46.72 21.96 2.127 0.0449 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9.463 on 22 degrees of freedom
#> Multiple R-squared: 0.6766, Adjusted R-squared: 0.6325
#> F-statistic: 15.34 on 3 and 22 DF, p-value: 1.314e-05The model 1 is the worst one. Model 3 is a little better than model 2, however model 2 is much simpler (model 2 is included in model 3!). Following Occam’s razor, we choose model 2 when compare it with model 3. We also choose model 2 when compare it with model 1 because it is better and consist with the fact APM and TMB are relatively independent but not absolutely independent. To summary, model 2 is the best option.
Any promotion of formula \(TIGS = APS \times log(TMB + 1)\)?
We still have a question about TIGS formula. The TMB in formula is calculated as
\[ TMB = ln(\frac{whole \quad exome \quad mutation \quad number}{38} + 1 ) \]
We choose nonsynonymous mutations because they have functional effects. Only functional effects on protein will generate neo-antigen (peptide).
We use 38 here because 38 MB is the estimated length of exome, so we can normalize TMB to per MB. However, in math model, there may be a better value “a” than 38. We don’t know yet, so we need to explore this “a” value.
We simulate “a” value from 0.1 to 10000 with step size 0.1 in the following formula, the resulted TIGS is then used to fit a linear model for pan-cancer ICI ORR prediction. We calculate differnt R Square value under each “a” and see how it changes.
\[ TIGS = APS_{normalized} \times ln(\frac{whole \quad exome \quad mutation \quad number}{a} + 1) \]
Implementation:
df = sm_data
df$APS = df$Pool_APM
df$TMB = df$Pool_TMB * 38 # transform back to whole exome nonsynonymous mutation number
df$ORR = df$Pool_ORR
df = dplyr::select(df, APS, TMB, ORR)
calcTrends = function(df, a){
sapply(a, function(x, data){
summary(lm(ORR ~ APS : log((TMB/x + 1)), data = df) )$r.squared
}, data = df)
}Plot the simulation result.
opar = par(no.readonly = TRUE)
par(mfrow=c(1,2))
plot(res_df, las = 1, xlab = "a", ylab = "R Square", type = "l")
plot(res_df[1:1000,], las = 1, xlab = "a", ylab = "R Square", type = "l")
abline(v = 38, lty = 2, col = "red")We can find that R squares of linear model have a maximum value when “a” is around 40. Therefore, “a=38” is optimized to rescale the raw whole exome mutation counts and balance the contribution for APM and TMB in TIGS formula.
Performance comparison on immunotherapy clinical response prediction and survival benefit
Description of datasets please see preprocessing part or “Method Section” of manuscript.
TMB is a well-known biomarker and has been applied. Considering our TIGS is based on TMB, thus the most important point we wanna do in this section is comparing the prediction/clinical perfermance between TMB and TIGS.
However, some other biomarkers can predict immunotherapy response have been studied by other groups. Therefore, this section we will include following biomarkers in comparsion.
- APS - antigen processing and presenting efficiency score
- TMB - tumor mutation burden
- TIGS - tumor immunogenicity score
- IIS - immune infiltration score
- IFNG - interferon gamma response biomarkers of 6 genes including IFNG, STAT1, IDO1, CXCL10, CXCL9, and HLA-DRA
- CD8 - gene expression level of CD8A + CD8B
- PDL1 - an immunohistochemistry (IHC) biomarker approved by FDA. This study we use PD-L1 gene expression as the IHC surrogate
- TIDE - TIDE signature which integrates the expression signatures of T cell dysfunction and T cell exclusion to model tumor immune evasion.
The predicted values of gene expression biomarkers (i.e. IFNG, CD8, PDL1) are the average values among all members defined by the original publications.
Preprocessing of immunotherapy datasets
The first step is to preprocess 3 immunotherapy datasets with gene expression, TMB and clinical outcome available. APS from Pancan analysis is used to normalize APS here.
# get min and max APM value based on TCGA and GEO pan-cancer data
load("results/df_all.RData")
min_APS = min(df_all$APM)
max_APS = max(df_all$APM)
rm(df_all)
# load gene list
load("results/merged_geneList.RData")
# load function
source("../code/functions.R")Hugo et al 2016 dataset
## Data source: Genomic and Transcriptomic Features of Response to Anti-PD-1 Therapy in Metastatic Melanoma. Cell 2016 Mar 24;165(1):35-44. PMID: 26997480
suppressPackageStartupMessages(library(GEOquery))
library(readxl)
geo_dir = "../data/GEOdata"
gse = "GSE78220"
GSE_78220 = getGEO(gse, GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_78220 = GSE_78220$GSE78220_series_matrix.txt.gz
#### Download gene expression using following command
# download.file(url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE78220&format=file&file=GSE78220%5FPatientFPKM%2Exlsx", destfile = "../data/GSE78220_FPKM.xlsx")
GSE78220_FPKM = readxl::read_xlsx("../data/GSE78220_FPKM.xlsx")
GSE78220_FPKM = GSE78220_FPKM %>%
as.data.frame() %>%
mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
arrange(Gene, desc(mean_expr)) %>%
distinct(Gene, .keep_all = TRUE) %>%
dplyr::select(-mean_expr) %>%
as.tibble()
# pData(GSE_78220) %>% View()
# normalize gene expression as log(x + 1)
GSE78220_Norm = GSE78220_FPKM
GSE78220_Norm[,-1] = log2(GSE78220_Norm[,-1] + 1)
# apply gsva
applyGSVA(merged_geneList, group_col = "Cell_type",
gene_col = "Symbol", ExprMatList = list(GSE78220_Norm), method = "gsva") -> gsva.hugo2016
gsva.hugo2016 = calc_TisIIs(gsva.hugo2016[[1]])
# normalize APS
APS_hugo = gsva.hugo2016$APM
APS_hugo = (APS_hugo - min_APS) / (max_APS - min_APS)
names(APS_hugo) = colnames(GSE78220_Norm)[-1]
APS_hugo
names(APS_hugo) = sub(pattern = "(Pt.*)\\.baseline", "\\1", names(APS_hugo))
# keep biggest value
APS_hugo = APS_hugo[-21]
names(APS_hugo)[c(14, 20)] = c("Pt16", "Pt27")
APS_hugo
# keep corresponding expression data for downstream analysis
rna_hugo = GSE78220_Norm[, -22]
colnames(rna_hugo)[-1] = names(APS_hugo)
save(rna_hugo, file = "results/Hugo2016_RNA.RData")
GSE78220.APS = tibble(PatientID = names(APS_hugo), APS=APS_hugo, IIS = gsva.hugo2016$IIS[-21])
# read clincal and tmb data
GSE_78220.TMB = read_csv("../data/Cell2016.csv", col_types = cols())
GSE_78220.TMB$nTMB = GSE_78220.TMB$TotalNonSyn / 38
info_hugo = full_join(GSE_78220.TMB, GSE78220.APS, by = "PatientID")
info_hugo = info_hugo[,c(1:5, 14:16, 13)]
# calculate TIGS score
info_hugo$TIGS = log(info_hugo$nTMB) * info_hugo$APS
save(info_hugo, file = "results/Hugo2016_Info.RData")Van Allen et al 2015 dataset
#----------------------------------------------------------
# Data source :
# Science 2015: Genomic correlates of response to CTLA4 blockade in metastatic melanoma
#--------------------------
library(readxl)
suppressPackageStartupMessages(library(tidyverse))
science2015_TMB = read_xlsx("../data/Science2015_TMB_List_AllPatients.xlsx")
science2015_cli1 = read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 1)
science2015_cli2 = read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 2)
science2015_rna = read_tsv("../data/MEL-IPI-Share.rpkm.gct", col_types = cols())
#---- Preprocess
vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
"Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
"In_Frame_Ins", "Missense_Mutation")
science2015_TMB %>% group_by(patient) %>%
summarise(TotalMutation=n(),
TotalNonSyn=sum(Variant_Classification %in% vc.nonSilent),
nTMB = TotalNonSyn / 38) -> science2015_TMB_summary
#---- remove duplicate symbols of rna data
science2015_rna %>% mutate(symbol = sub("(.*)\\..*$", "\\1", Description)) %>%
dplyr::select(-Name, -Description) %>%
dplyr::select(symbol, everything()) %>%
as.data.frame() %>%
mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
arrange(symbol, desc(mean_expr)) %>%
distinct(symbol, .keep_all = TRUE) %>%
dplyr::select(-mean_expr) %>%
as.tibble() -> science2015_ge
length(intersect(science2015_ge$symbol, GSE78220_Norm$Gene))
# the gene number of science paper is bigger than hugo 2016
# maybe miRNA and some other RNA included
# keep them almost equal\
science2015_ge = science2015_ge %>%
filter(symbol %in% GSE78220_Norm$Gene)
science2015_ge[, -1] = log2(science2015_ge[, -1] + 1)
applyGSVA(merged_geneList, group_col = "Cell_type",
gene_col = "Symbol",
ExprMatList = list(science2015_ge), method = "gsva") -> gsva.VanAllen2015
gsva.VanAllen2015 = calc_TisIIs(gsva.VanAllen2015[[1]])
APS_VanAllen = gsva.VanAllen2015$APM
names(APS_VanAllen) = colnames(science2015_ge)[-1]
APS_VanAllen = (APS_VanAllen - min_APS) / (max_APS - min_APS)
APS_VanAllen
names(APS_VanAllen) = sub("^MEL.IPI_(.*)\\.Tumor.*$", "\\1", names(APS_VanAllen))
APS_VanAllen
rna_VanAllen = science2015_ge
colnames(rna_VanAllen)[-1] = names(APS_VanAllen)
## save gene experssion
save(rna_VanAllen, file = "results/VanAllen2015_RNA.RData")
APS_VanAllen_df = tibble(patient = names(APS_VanAllen), APS = APS_VanAllen,
IIS = gsva.VanAllen2015$IIS)
science2015_cli = bind_rows(
dplyr::select(science2015_cli1, patient, age_start, RECIST, overall_survival, progression_free, primary, group,
histology, stage, gender, dead, progression, neos50),
dplyr::select(science2015_cli2, patient, age_start, RECIST, overall_survival, progression_free, primary,
group, histology, stage, gender, dead, progression) %>% filter(patient %in% c("Pat20", "Pat91"))
)
# combine three datasets
info_VanAllen = full_join(full_join(science2015_cli, APS_VanAllen_df, by = "patient"),
science2015_TMB_summary, by="patient")
info_VanAllen$TIGS = info_VanAllen$APS * log(info_VanAllen$nTMB +1) # to avoid TIGS <0, TIGS = log(TMB +1) *APM for this dataset
save(info_VanAllen, file = "results/VanAllen2015_Info.RData")Snyder et al 2017 dataset
#----------------------------------------------------------------
# Contribution of systemic and somatic factors to clinical response and resistance in urothelial cancer: an exploratory multi-omic analysis
#---------------------------------------------------------------
# Data Source: https://github.com/XSLiuLab/multi-omic-urothelial-anti-pdl1
# This is a fork repository, original link can also be found in this link
library(tidyverse)
uro_cli = read_csv("../data/data_clinical.csv", col_types = cols())
uro_counts = read_csv("../data/data_counts.csv", col_types = cols())
uro_variants = read_csv("../data/data_variants.csv",
col_types = "c??????")
uro_rna = read_csv("../data/data_kallisto.csv", col_types = cols())
uro_rna_wide = reshape2::dcast(uro_rna, gene_name ~ patient_id, value.var = "est_counts")
# try remove duplicated genes
uro_rna_wide %>%
as.data.frame() %>%
mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>%
arrange(gene_name, desc(mean_expr)) %>%
distinct(gene_name, .keep_all = TRUE) %>%
dplyr::select(-mean_expr) %>%
as.tibble() -> uro_rna_ge
length(intersect(uro_rna_ge$gene_name, GSE78220_FPKM$Gene))
uro_rna_ge = dplyr::filter(uro_rna_ge, gene_name %in% GSE78220_FPKM$Gene)
uro_rna_ge_Norm = uro_rna_ge
uro_rna_ge_Norm[, -1] = log2(uro_rna_ge_Norm[,-1]+1)
## apply GSVA
applyGSVA(merged_geneList, group_col = "Cell_type",
gene_col = "Symbol", ExprMatList = list(uro_rna_ge_Norm), method = "gsva") -> gsva.Snyder
gsva.Snyder = calc_TisIIs(gsva.Snyder[[1]])
APS_Snyder = gsva.Snyder$APM
APS_Snyder = (APS_Snyder - min_APS) / (max_APS - min_APS)
## save gene expression
rna_Snyder = uro_rna_ge_Norm
save(rna_Snyder, file = "results/Snyder2017_RNA.RData")
APS_Snyder_df = tibble(patient_id = colnames(uro_rna_ge_Norm)[-1],
APS = APS_Snyder,
IIS = gsva.Snyder$IIS)
uro_cli_2 = uro_cli %>%
dplyr::select(patient_id, Age, Sex, is_benefit, is_benefit_os, os, `Alive Status`) %>%
mutate(event = ifelse(`Alive Status` == 'Y', 0, 1))
uro_tmb = uro_variants %>%
group_by(patient_id) %>%
summarise(TMB = n(), Nonsyn = sum(!is.na(gene_name)), nTMB = TMB/38) %>%
mutate(patient_id = as.character(patient_id))
info_Snyder = dplyr::full_join(uro_cli_2, dplyr::full_join(uro_tmb, APS_Snyder_df))
info_Snyder$TIGS = info_Snyder$APS * log(info_Snyder$nTMB)
save(info_Snyder, file = "results/Snyder2017_Info.RData")
rm(list = ls())Gene expression biomarkers for response to immune checkpoint blockade
This section we will describe how to calculate predicted value of gene signature biomarker using gene expression data.
Load data firstly.
# library(tidyverse)
sigList = read_csv("../data/ICBbiomarkerSignature.csv", comment = "#", col_types = cols())
load("results/Hugo2016_RNA.RData")
load("results/VanAllen2015_RNA.RData")
load("results/Snyder2017_RNA.RData")Check if all gene signatures are in data.
all(sigList$Genes %in% rna_hugo$Gene)
#> [1] TRUE
all(sigList$Genes %in% rna_Snyder$gene_name)
#> [1] TRUE
all(sigList$Genes %in% rna_VanAllen$symbol)
#> [1] TRUEWe got all TRUE. Next we retrieve all gene expression values and then calculate average.
rna_hugo_df = rna_hugo %>%
dplyr::filter(Gene %in% sigList$Genes)
rna_VanAllen_df = rna_VanAllen %>%
dplyr::filter(symbol %in% sigList$Genes) %>%
rename(Gene = symbol)
rna_Snyder_df = rna_Snyder %>%
dplyr::filter(gene_name %in% sigList$Genes) %>%
rename(Gene = gene_name)Combine them with sigList, then transform these data.frame to long format, calculate mean gene expression and transform back to wide format.
rna_hugo_sigExpr = rna_hugo_df %>%
dplyr::right_join(sigList, by=c("Gene"="Genes")) %>%
tidyr::gather(key = "sample", value = "expression", -Gene, -Signature) %>%
dplyr::group_by(sample, Signature) %>%
dplyr::summarise(expression = mean(expression)) %>%
tidyr::spread(key=Signature, value=expression) %>%
dplyr::mutate(study = "Hugo2016")
rna_VanAllen_sigExpr = rna_VanAllen_df %>%
dplyr::right_join(sigList, by=c("Gene"="Genes")) %>%
tidyr::gather(key = "sample", value = "expression", -Gene, -Signature) %>%
dplyr::group_by(sample, Signature) %>%
dplyr::summarise(expression = mean(expression)) %>%
tidyr::spread(key=Signature, value=expression) %>%
dplyr::mutate(study = "VanAllen2015")
rna_Snyder_sigExpr = rna_Snyder_df %>%
dplyr::right_join(sigList, by=c("Gene"="Genes")) %>%
tidyr::gather(key = "sample", value = "expression", -Gene, -Signature) %>%
dplyr::group_by(sample, Signature) %>%
dplyr::summarise(expression = mean(expression)) %>%
tidyr::spread(key=Signature, value=expression) %>%
dplyr::mutate(study = "Snyder2017")Show results as tables.
DT::datatable(rna_hugo_sigExpr,
options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
caption = "gene signature expression in Hugo 2016 dataset")
DT::datatable(rna_VanAllen_sigExpr,
options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
caption = "gene signature expression in VanAllen 2015 dataset")
DT::datatable(rna_Snyder_sigExpr,
options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
caption = "gene signature expression in Snyder 2017 dataset")Lastly, merge the three result datasets for downstream analysis.
Input file generation for TIDE
TIDE is developed by Shirley Liu lab, software website is available at (http://tide.dfci.harvard.edu/). We can generate input file for TIDE and submit it to the website for TIDE score calculation.
Following suggestions provided at TIDE website, we firstly generate input files for three immunotherapy datasets .
rna_Snyder = rna_Snyder %>%
rename(Gene = gene_name)
rna_VanAllen = rna_VanAllen %>%
rename(Gene = symbol)
# normalized expression substract mean expression
generateInput = function(df) {
df %>%
mutate(Ref = rowMeans(.[, -1])) %>%
mutate_at(vars(-Gene, -Ref), `-`, quote(Ref)) %>% #f <- function(num, ref) num - !!ref
select(-Ref) %>%
column_to_rownames(var = "Gene") %>%
as.data.frame()
}
TIDE_hugo = generateInput(rna_hugo)
#> Warning: Setting row names on a tibble is deprecated.
TIDE_VanAllen = generateInput(rna_VanAllen)
#> Warning: Setting row names on a tibble is deprecated.
TIDE_Snyder = generateInput(rna_Snyder)
#> Warning: Setting row names on a tibble is deprecated.
if (!file.exists("results/TIDE_input_hugo.tsv")){
write.table(TIDE_hugo, file="results/TIDE_input_hugo.tsv", quote = FALSE, sep = "\t", row.names = TRUE, col.names = TRUE)
write.table(TIDE_VanAllen, file="results/TIDE_input_VanAllen.tsv", quote = FALSE, sep = "\t", row.names = TRUE, col.names = TRUE)
write.table(TIDE_Snyder, file="results/TIDE_input_Snyder.tsv", quote = FALSE, sep = "\t", row.names = TRUE, col.names = TRUE)
}
rm(list = ls())Next we submit the three tsv file to TIDE website. Previous immunotherapy item is set as “No”. The result csv files are saved to report/results directory.
ROC comparison
This section we compare ROC (AUC value) of biomarkers for response prediction.
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
library(tidyverse)
#theme_set(theme_bw())
#library(cowplot)
#------------------------------------------
# Load data
load("results/Hugo2016_Info.RData")
load("results/VanAllen2015_Info.RData")
load("results/Snyder2017_Info.RData")
load("results/GEBiomarker.RData")
# Of note, output csv from TIDE website has extra "" in last line
# please remove it before load data
TIDE_hugo = read_csv("results/TIDE_output_hugo.csv", col_types = cols())
TIDE_Snyder = read_csv("results/TIDE_output_Snyder.csv", col_types = cols())
TIDE_VanAllen = read_csv("results/TIDE_output_VanAllen.csv", col_types = cols())
#-----------------------------------------
# Merge datasets
info_hugo = info_hugo %>%
dplyr::left_join(dplyr::filter(GEBiomarker, study=="Hugo2016") %>%
dplyr::select(-study), by=c("PatientID" = "sample")) %>%
dplyr::left_join(TIDE_hugo %>% dplyr::select(Patient, TIDE), by = c("PatientID" = "Patient"))
#------------------------------------------
# Hugo 2016
hugo2016_roc_TMB = roc(info_hugo$Response, info_hugo$nTMB)
hugo2016_roc_APS = roc(info_hugo$Response, info_hugo$APS)
hugo2016_roc_TIGS = roc(info_hugo$Response, info_hugo$TIGS)
hugo2016_roc_IIS = roc(info_hugo$Response, info_hugo$IIS)
hugo2016_roc_CD8 = roc(info_hugo$Response, info_hugo$CD8)
hugo2016_roc_IFNG = roc(info_hugo$Response, info_hugo$IFNG)
hugo2016_roc_PDL1 = roc(info_hugo$Response, info_hugo$PDL1)
hugo2016_roc_TIDE = roc(info_hugo$Response, info_hugo$TIDE)
ggroc(list(TMB=hugo2016_roc_TMB, TIGS=hugo2016_roc_TIGS, TIDE=hugo2016_roc_TIDE), legacy.axes = TRUE) +
labs(color = "Predictor") + ggpubr::theme_classic2(base_size = 12, base_family = "Arial")
# AUC of predictors
auc(hugo2016_roc_TMB) # AUC of TMB
#> Area under the curve: 0.5994
auc(hugo2016_roc_APS) # AUC of APS
#> Area under the curve: 0.6978
auc(hugo2016_roc_TIGS) # AUC of TIGS
#> Area under the curve: 0.7692
auc(hugo2016_roc_IIS) # AUC of IIS
#> Area under the curve: 0.6154
auc(hugo2016_roc_CD8) # AUC of CD8
#> Area under the curve: 0.5055
auc(hugo2016_roc_IFNG) # AUC of IFNG
#> Area under the curve: 0.4945
auc(hugo2016_roc_PDL1) # AUC of PDL1
#> Area under the curve: 0.4286
auc(hugo2016_roc_TIDE) # AUC of TIDE
#> Area under the curve: 0.783
#---------------------------
# VanAllen 2015
info_VanAllen = info_VanAllen %>%
dplyr::left_join(dplyr::filter(GEBiomarker, study=="VanAllen2015") %>%
dplyr::select(-study), by=c("patient" = "sample")) %>%
dplyr::left_join(TIDE_VanAllen %>% dplyr::select(Patient, TIDE), by = c("patient" = "Patient"))
info_VanAllen2 = info_VanAllen %>%
mutate(Response = case_when(
group == "response" ~ "R",
group == "nonresponse" ~"NR",
TRUE ~ NA_character_
)) %>% filter(!is.na(Response))
## -- using following command if only compare patients with TIGS availble
## -- the result are basically same
# info_VanAllen2 = info_VanAllen %>%
# mutate(Response = case_when(
# group == "response" ~ "R",
# group == "nonresponse" ~"NR",
# TRUE ~ NA_character_
# )) %>% filter(!is.na(Response), !is.na(TIGS))
VanAllen2015_roc_TMB = roc(info_VanAllen2$Response, info_VanAllen2$nTMB)
VanAllen2015_roc_APS = roc(info_VanAllen2$Response, info_VanAllen2$APS)
VanAllen2015_roc_TIGS = roc(info_VanAllen2$Response, info_VanAllen2$TIGS)
VanAllen2015_roc_IIS = roc(info_VanAllen2$Response, info_VanAllen2$IIS)
VanAllen2015_roc_CD8 = roc(info_VanAllen2$Response, info_VanAllen2$CD8)
VanAllen2015_roc_IFNG = roc(info_VanAllen2$Response, info_VanAllen2$IFNG)
VanAllen2015_roc_PDL1 = roc(info_VanAllen2$Response, info_VanAllen2$PDL1)
VanAllen2015_roc_TIDE = roc(info_VanAllen2$Response, info_VanAllen2$TIDE)
ggroc(list(TMB=VanAllen2015_roc_TMB,TIGS=VanAllen2015_roc_TIGS, TIDE=VanAllen2015_roc_TIDE),
legacy.axes = TRUE) +
labs(color = "Predictor") +
ggpubr::theme_classic2(base_size = 12, base_family = "Arial")
# AUC of predictors
auc(VanAllen2015_roc_TMB) # AUC of TMB
#> Area under the curve: 0.6748
auc(VanAllen2015_roc_APS) # AUC of APS
#> Area under the curve: 0.6894
auc(VanAllen2015_roc_TIGS) # AUC of TIGS
#> Area under the curve: 0.7552
auc(VanAllen2015_roc_IIS) # AUC of IIS
#> Area under the curve: 0.6708
auc(VanAllen2015_roc_CD8) # AUC of CD8
#> Area under the curve: 0.6615
auc(VanAllen2015_roc_PDL1) # AUC of PDL1
#> Area under the curve: 0.6087
auc(VanAllen2015_roc_IFNG) # AUC of IFNG
#> Area under the curve: 0.6708
auc(VanAllen2015_roc_TIDE) # AUC of TIDE
#> Area under the curve: 0.7438
#---------------------------------------
# Snyder 2017
info_Snyder = info_Snyder %>%
dplyr::left_join(dplyr::filter(GEBiomarker, study=="Snyder2017") %>%
dplyr::select(-study), by=c("patient_id" = "sample")) %>%
dplyr::left_join(TIDE_Snyder %>% dplyr::select(Patient, TIDE), by = c("patient_id" = "Patient"))
Snyder2017_roc_TMB = roc(info_Snyder$is_benefit, info_Snyder$nTMB)
Snyder2017_roc_APS = roc(info_Snyder$is_benefit, info_Snyder$APS)
Snyder2017_roc_TIGS = roc(info_Snyder$is_benefit, info_Snyder$TIGS)
Snyder2017_roc_IIS = roc(info_Snyder$is_benefit, info_Snyder$IIS)
Snyder2017_roc_CD8 = roc(info_Snyder$is_benefit, info_Snyder$CD8)
Snyder2017_roc_PDL1 = roc(info_Snyder$is_benefit, info_Snyder$PDL1)
Snyder2017_roc_IFNG = roc(info_Snyder$is_benefit, info_Snyder$IFNG)
Snyder2017_roc_TIDE = roc(info_Snyder$is_benefit, info_Snyder$TIDE)
ggroc(list(TMB=Snyder2017_roc_TMB, TIGS=Snyder2017_roc_TIGS, TIDE=Snyder2017_roc_TIDE),
legacy.axes = TRUE) +
labs(color = "Predictor") + ggpubr::theme_classic2(base_size = 12, base_family = "Arial")
# AUC of predictors
auc(Snyder2017_roc_TMB) # AUC of TMB
#> Area under the curve: 0.6923
auc(Snyder2017_roc_APS) # AUC of APS
#> Area under the curve: 0.6806
auc(Snyder2017_roc_TIGS) # AUC of TIGS
#> Area under the curve: 0.7607
auc(Snyder2017_roc_IIS) # AUC of IIS
#> Area under the curve: 0.4792
auc(Snyder2017_roc_CD8) # AUC of CD8
#> Area under the curve: 0.6736
auc(Snyder2017_roc_PDL1) # AUC of PDL1
#> Area under the curve: 0.5833
auc(Snyder2017_roc_IFNG) # AUC of IFNG
#> Area under the curve: 0.6042
auc(Snyder2017_roc_TIDE) # AUC of TIDE
#> Area under the curve: 0.7014From comparison of AUC value above, we can conclude that:
- TIGS and TIDE are better than TMB
- TIGS and TIDE have almost same performance in two melanoma datasets, however, performance of TIDE go down a little in urothelial cancer dataset, maybe resulting from the fact that TIDE is trained on data of melanoma and NSCLC cancer (this point is stated at TIDE website, cancer type except for melanoma and NSCLC may not work).
- APS, TMB are also not bad predictors (they have almost same performance)
- Other predictors are not good predictors, they perform differently in different datasets.
Next, we summary results above for final visualization.
# summary AUC
auc_df = tibble(
Group = c("Hugo2016", "VanAllen2015", "Snyder2017"),
TIGS = c(as.numeric(auc(hugo2016_roc_TIGS)),
as.numeric(auc(VanAllen2015_roc_TIGS)),
as.numeric(auc(Snyder2017_roc_TIGS))),
TIDE = c(as.numeric(auc(hugo2016_roc_TIDE)),
as.numeric(auc(VanAllen2015_roc_TIDE)),
as.numeric(auc(Snyder2017_roc_TIDE))),
TMB = c(as.numeric(auc(hugo2016_roc_TMB)),
as.numeric(auc(VanAllen2015_roc_TMB)),
as.numeric(auc(Snyder2017_roc_TMB))),
APS = c(as.numeric(auc(hugo2016_roc_APS)),
as.numeric(auc(VanAllen2015_roc_APS)),
as.numeric(auc(Snyder2017_roc_APS))),
IIS = c(as.numeric(auc(hugo2016_roc_IIS)),
as.numeric(auc(VanAllen2015_roc_IIS)),
as.numeric(auc(Snyder2017_roc_IIS))),
IFNG = c(as.numeric(auc(hugo2016_roc_IFNG)),
as.numeric(auc(VanAllen2015_roc_IFNG)),
as.numeric(auc(Snyder2017_roc_IFNG))),
PDL1 = c(as.numeric(auc(hugo2016_roc_PDL1)),
as.numeric(auc(VanAllen2015_roc_PDL1)),
as.numeric(auc(Snyder2017_roc_PDL1))),
CD8 = c(as.numeric(auc(hugo2016_roc_CD8)),
as.numeric(auc(VanAllen2015_roc_CD8)),
as.numeric(auc(Snyder2017_roc_CD8)))
)
# Show AUC table
DT::datatable(auc_df,
options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
caption = "AUC of biomarkers for response prediction in three ICB datasets")# summary ROC object
roc_df = tibble(
Group = c("Hugo2016", "VanAllen2015", "Snyder2017"),
TIGS = c(list(hugo2016_roc_TIGS),
list(VanAllen2015_roc_TIGS),
list(Snyder2017_roc_TIGS)),
TIDE = c(list(hugo2016_roc_TIDE),
list(VanAllen2015_roc_TIDE),
list(Snyder2017_roc_TIDE)),
TMB = c(list(hugo2016_roc_TMB),
list(VanAllen2015_roc_TMB),
list(Snyder2017_roc_TMB)),
APS = c(list(hugo2016_roc_APS),
list(VanAllen2015_roc_APS),
list(Snyder2017_roc_APS)),
IIS = c(list(hugo2016_roc_IIS),
list(VanAllen2015_roc_IIS),
list(Snyder2017_roc_IIS)),
IFNG = c(list(hugo2016_roc_IFNG),
list(VanAllen2015_roc_IFNG),
list(Snyder2017_roc_IFNG)),
PDL1 = c(list(hugo2016_roc_PDL1),
list(VanAllen2015_roc_PDL1),
list(Snyder2017_roc_PDL1)),
CD8 = c(list(hugo2016_roc_CD8),
list(VanAllen2015_roc_CD8),
list(Snyder2017_roc_CD8))
)
if (!file.exists("results/roc_comparison_result.RData")) {
save(auc_df, roc_df, file = "results/roc_comparison_result.RData")
}Survival comparison
Above we already show APS, TMB, TIGS and TIDE are good predictors, this section we focus on them. We use cox model and separate patients into High and Low group in K-M survival analysis based on median biomarker predicted value.
library(survival)
library(survminer)
#> Loading required package: ggpubr
#> Loading required package: magrittr
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
#----------------------
# Hugo 2016
hugo_os = read_csv("../data/cell2016_OS.csv", col_types = cols())
info_hugo = full_join(info_hugo, hugo_os, by="PatientID")
# Cox model
# Cox result of TMB
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~
#> nTMB, data = info_hugo)
#>
#> coef exp(coef) se(coef) z p
#> nTMB -0.03045 0.97001 0.02179 -1.397 0.162
#>
#> Likelihood ratio test=3.05 on 1 df, p=0.08059
#> n= 37, number of events= 19
#> (1 observation deleted due to missingness)
# Cox result of APS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~
#> APS, data = info_hugo)
#>
#> coef exp(coef) se(coef) z p
#> APS -1.9820 0.1378 1.4845 -1.335 0.182
#>
#> Likelihood ratio test=1.84 on 1 df, p=0.1755
#> n= 26, number of events= 12
#> (12 observations deleted due to missingness)
# Cox result of TIGS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~
#> TIGS, data = info_hugo)
#>
#> coef exp(coef) se(coef) z p
#> TIGS -1.2140 0.2970 0.5536 -2.193 0.0283
#>
#> Likelihood ratio test=5.9 on 1 df, p=0.01515
#> n= 26, number of events= 12
#> (12 observations deleted due to missingness)
# Cox result of TIDE
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~
#> TIDE, data = info_hugo)
#>
#> coef exp(coef) se(coef) z p
#> TIDE 0.1668 1.1815 0.3086 0.54 0.589
#>
#> Likelihood ratio test=0.29 on 1 df, p=0.5919
#> n= 26, number of events= 12
#> (12 observations deleted due to missingness)
# K-M fit
info_hugo %>% # take care of NA values
mutate(APS_Status = ifelse(APS>median(APS, na.rm = TRUE), "High",
ifelse(is.na(APS), NA, "Low")),
TIDE_Status = ifelse(TIDE>median(TIDE, na.rm = TRUE), "High",
ifelse(is.na(TIDE), NA, "Low")),
TMB_Status = ifelse(nTMB>median(nTMB), "High", "Low"),
TIGS_Status = ifelse(TIGS>median(TIGS, na.rm = TRUE), "High",
ifelse(is.na(TIGS), NA, "Low"))) -> info_hugo2
# effect of TMB status on survival
fit1 = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TMB_Status, data = info_hugo2)
ggsurvplot(fit1, data = info_hugo2, pval = TRUE, fun = "pct",
xlab = "Time (in days)")
# effect of TIGS status on survival
fit2 = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS_Status, data = info_hugo2)
ggsurvplot(fit2, data = info_hugo2, pval = TRUE, fun="pct", xlab = "Time (in days)")
# effect of APS status on survival
fit3 = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS_Status, data = info_hugo2)
ggsurvplot(fit3, data = info_hugo2, pval = TRUE, fun = "pct",
xlab = "Time (in days)")
# effect of TIDE status on survival
fit4 = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE_Status, data = info_hugo2)
ggsurvplot(fit4, data = info_hugo2, pval = TRUE, fun="pct", xlab = "Time (in days)")
#----------------------
# Van Allen 2015
# Cox model
# Cox result of TMB
coxph(Surv(overall_survival, dead== 1) ~ nTMB, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)
#>
#> coef exp(coef) se(coef) z p
#> nTMB -0.007145 0.992880 0.005635 -1.268 0.205
#>
#> Likelihood ratio test=1.94 on 1 df, p=0.1632
#> n= 110, number of events= 83
#> (2 observations deleted due to missingness)
# Cox result of APS
coxph(Surv(overall_survival, dead== 1) ~ APS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)
#>
#> coef exp(coef) se(coef) z p
#> APS -1.6820 0.1860 0.7618 -2.208 0.0273
#>
#> Likelihood ratio test=5.23 on 1 df, p=0.0222
#> n= 42, number of events= 29
#> (70 observations deleted due to missingness)
# Cox result of TIGS
coxph(Surv(overall_survival, dead== 1) ~ TIGS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)
#>
#> coef exp(coef) se(coef) z p
#> TIGS -0.5135 0.5984 0.2676 -1.919 0.055
#>
#> Likelihood ratio test=4.68 on 1 df, p=0.03058
#> n= 40, number of events= 27
#> (72 observations deleted due to missingness)
# Cox result of TIDE
coxph(Surv(overall_survival, dead== 1) ~ TIDE, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)
#>
#> coef exp(coef) se(coef) z p
#> TIDE 0.5358 1.7088 0.1813 2.955 0.00313
#>
#> Likelihood ratio test=8.22 on 1 df, p=0.004134
#> n= 42, number of events= 29
#> (70 observations deleted due to missingness)
info_VanAllen %>%
filter(group != "long-survival") %>%
mutate(APS_Status = ifelse(APS>median(APS, na.rm = TRUE), "High",
ifelse(is.na(APS), NA, "Low")),
TIDE_Status = ifelse(TIDE>median(TIDE, na.rm = TRUE), "High",
ifelse(is.na(TIDE), NA, "Low")),
TMB_Status = ifelse(nTMB>median(nTMB, na.rm = TRUE), "High",
ifelse(is.na(nTMB), NA, "Low")),
TIGS_Status = ifelse(TIGS>median(TIGS, na.rm = TRUE), "High",
ifelse(is.na(TIGS), NA, "Low"))) -> info_VanAllen3
# effect of TMB status on survival
fit1 = survfit(Surv(overall_survival, dead== 1) ~ TMB_Status, data = info_VanAllen3)
ggsurvplot(fit1, data = info_VanAllen3, pval = TRUE, fun = "pct",
xlab = "Time (in days)")
# effect of TIGS status on survival
fit2 = survfit(Surv(overall_survival, dead== 1) ~ TIGS_Status, data = info_VanAllen3)
ggsurvplot(fit2, data = info_VanAllen3, pval = TRUE, fun="pct", xlab = "Time (in days)")
# effect of APS status on survival
fit3 = survfit(Surv(overall_survival, dead== 1) ~ APS_Status, data = info_VanAllen3)
ggsurvplot(fit3, data = info_VanAllen3, pval = TRUE, fun = "pct",
xlab = "Time (in days)")
# effect of TIDE status on survival
fit4 = survfit(Surv(overall_survival, dead== 1) ~ TIDE_Status, data = info_VanAllen3)
ggsurvplot(fit4, data = info_VanAllen3, pval = TRUE, fun="pct", xlab = "Time (in days)")
#----------------------
# Snyder 2017
# Cox model
# Cox result of TMB
coxph(Surv(os, event) ~ nTMB, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ nTMB, data = info_Snyder)
#>
#> coef exp(coef) se(coef) z p
#> nTMB -0.02115 0.97907 0.02151 -0.983 0.325
#>
#> Likelihood ratio test=1.1 on 1 df, p=0.2946
#> n= 22, number of events= 14
#> (6 observations deleted due to missingness)
# Cox result of APS
coxph(Surv(os, event) ~ APS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ APS, data = info_Snyder)
#>
#> coef exp(coef) se(coef) z p
#> APS -1.010 0.364 1.104 -0.915 0.36
#>
#> Likelihood ratio test=0.85 on 1 df, p=0.3576
#> n= 25, number of events= 17
#> (3 observations deleted due to missingness)
# Cox result of TIGS
coxph(Surv(os, event) ~ TIGS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIGS, data = info_Snyder)
#>
#> coef exp(coef) se(coef) z p
#> TIGS -0.4863 0.6149 0.3450 -1.41 0.159
#>
#> Likelihood ratio test=2.39 on 1 df, p=0.1219
#> n= 22, number of events= 14
#> (6 observations deleted due to missingness)
# Cox result of TIDE
coxph(Surv(os, event) ~ TIDE, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIDE, data = info_Snyder)
#>
#> coef exp(coef) se(coef) z p
#> TIDE -0.3926 0.6753 0.3332 -1.178 0.239
#>
#> Likelihood ratio test=1.44 on 1 df, p=0.2301
#> n= 25, number of events= 17
#> (3 observations deleted due to missingness)
info_Snyder %>%
mutate(APS_Status = ifelse(APS>median(APS, na.rm = TRUE), "High",
ifelse(is.na(APS), NA, "Low")),
TIDE_Status = ifelse(TIDE>median(TIDE, na.rm = TRUE), "High",
ifelse(is.na(TIDE), NA, "Low")),
TMB_Status = ifelse(nTMB>median(nTMB, na.rm = TRUE), "High",
ifelse(is.na(nTMB), NA, "Low")),
TIGS_Status = ifelse(TIGS>median(TIGS, na.rm = TRUE), "High",
ifelse(is.na(TIGS), NA, "Low"))) -> info_Snyder2
# effect of TMB status on survival
fit1 = survfit(Surv(os, event) ~ TMB_Status, data = info_Snyder2)
ggsurvplot(fit1, data = info_Snyder2, pval = TRUE, fun = "pct",
xlab = "Time (in days)")
# effect of TIGS status on survival
fit2 = survfit(Surv(os, event) ~ TIGS_Status, data = info_Snyder2)
ggsurvplot(fit2, data = info_Snyder2, pval = TRUE, fun="pct", xlab = "Time (in days)")
# effect of APS status on survival
fit3 = survfit(Surv(os, event) ~ APS_Status, data = info_Snyder2)
ggsurvplot(fit3, data = info_Snyder2, pval = TRUE, fun = "pct",
xlab = "Time (in days)")
# effect of TIDE status on survival
fit4 = survfit(Surv(os, event) ~ TIDE_Status, data = info_Snyder2)
ggsurvplot(fit4, data = info_Snyder2, pval = TRUE, fun="pct", xlab = "Time (in days)")From comparison of cox results and K-M plots above, we can conclude that:
- TIGS is very robust, in all three datasets, “High group” all show statistically significant better survival than “Low group”.
- TIDE performs just so-so in Hugo 2016 dataset, very good in Van Allen 2015 dataset. But, it losts its power in Snyder 2017 dataset, patients with low TIDE value should have better survival but in this dataset patients with high TIDE value have better survival.
- Patients with High TMB or APS in three datasets all show trends of better survival but not statistically significant.
Next, we summary results (cox HR value and K-M fit) above for final visualization.
# summary cox result
cox_TMB = list()
cox_APS = list()
cox_TIGS = list()
cox_TIDE = list()
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)) -> cox_TMB[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)) -> cox_APS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)) -> cox_TIGS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)) -> cox_TIDE[[1]]
summary(coxph(Surv(overall_survival, dead== 1) ~ nTMB, data = info_VanAllen)) -> cox_TMB[[2]]
summary(coxph(Surv(overall_survival, dead== 1) ~ APS, data = info_VanAllen)) -> cox_APS[[2]]
summary(coxph(Surv(overall_survival, dead== 1) ~ TIGS, data = info_VanAllen)) -> cox_TIGS[[2]]
summary(coxph(Surv(overall_survival, dead== 1) ~ TIDE, data = info_VanAllen)) -> cox_TIDE[[2]]
summary(coxph(Surv(os, event) ~ nTMB, data = info_Snyder)) -> cox_TMB[[3]]
summary(coxph(Surv(os, event) ~ APS, data = info_Snyder)) -> cox_APS[[3]]
summary(coxph(Surv(os, event) ~ TIGS, data = info_Snyder)) -> cox_TIGS[[3]]
summary(coxph(Surv(os, event) ~ TIDE, data = info_Snyder)) -> cox_TIDE[[3]]
cox_df = tibble(
Group = c(rep("Hugo2016",4), rep("VanAllen2015",4), rep("Snyder2017",4)),
Biomarker = rep(c("TMB", "APS", "TIGS", "TIDE"), 3),
HR = c(
as.numeric(cox_TMB[[1]]$conf.int[1]),
as.numeric(cox_APS[[1]]$conf.int[1]),
as.numeric(cox_TIGS[[1]]$conf.int[1]),
as.numeric(cox_TIDE[[1]]$conf.int[1]),
as.numeric(cox_TMB[[2]]$conf.int[1]),
as.numeric(cox_APS[[2]]$conf.int[1]),
as.numeric(cox_TIGS[[2]]$conf.int[1]),
as.numeric(cox_TIDE[[2]]$conf.int[1]),
as.numeric(cox_TMB[[3]]$conf.int[1]),
as.numeric(cox_APS[[3]]$conf.int[1]),
as.numeric(cox_TIGS[[3]]$conf.int[1]),
as.numeric(cox_TIDE[[3]]$conf.int[1])
),
Pvalue = c(
as.numeric(cox_TMB[[1]]$logtest["pvalue"]),
as.numeric(cox_APS[[1]]$logtest["pvalue"]),
as.numeric(cox_TIGS[[1]]$logtest["pvalue"]),
as.numeric(cox_TIDE[[1]]$logtest["pvalue"]),
as.numeric(cox_TMB[[2]]$logtest["pvalue"]),
as.numeric(cox_APS[[2]]$logtest["pvalue"]),
as.numeric(cox_TIGS[[2]]$logtest["pvalue"]),
as.numeric(cox_TIDE[[2]]$logtest["pvalue"]),
as.numeric(cox_TMB[[3]]$logtest["pvalue"]),
as.numeric(cox_APS[[3]]$logtest["pvalue"]),
as.numeric(cox_TIGS[[3]]$logtest["pvalue"]),
as.numeric(cox_TIDE[[3]]$logtest["pvalue"])
)
)
# show summary data as table
DT::datatable(cox_df,
options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE,
caption = "HR of biomarkers for response prediction in three ICB datasets")
# summary K-M fit
km_TMB = list()
km_APS = list()
km_TIGS = list()
km_TIDE = list()
km_TMB[["Hugo"]] = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TMB_Status, data = info_hugo2)
km_TIGS[["Hugo"]] = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS_Status, data = info_hugo2)
km_APS[["Hugo"]] = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS_Status, data = info_hugo2)
km_TIDE[["Hugo"]] = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE_Status, data = info_hugo2)
km_TMB[["VanAllen"]] = survfit(Surv(overall_survival, dead== 1) ~ TMB_Status, data = info_VanAllen3)
km_TIGS[["VanAllen"]] = survfit(Surv(overall_survival, dead== 1) ~ TIGS_Status, data = info_VanAllen3)
km_APS[["VanAllen"]] = survfit(Surv(overall_survival, dead== 1) ~ APS_Status, data = info_VanAllen3)
km_TIDE[["VanAllen"]] = survfit(Surv(overall_survival, dead== 1) ~ TIDE_Status, data = info_VanAllen3)
km_TMB[["Snyder"]] = survfit(Surv(os, event) ~ TMB_Status, data = info_Snyder2)
km_TIGS[["Snyder"]] = survfit(Surv(os, event) ~ TIGS_Status, data = info_Snyder2)
km_APS[["Snyder"]] = survfit(Surv(os, event) ~ APS_Status, data = info_Snyder2)
km_TIDE[["Snyder"]] = survfit(Surv(os, event) ~ TIDE_Status, data = info_Snyder2)
km_fit = list(TMB=km_TMB, APS=km_APS, TIGS=km_TIGS, TIDE=km_TIDE)
if (!file.exists("results/cox_km_result.RData")) {
save(cox_df, km_fit, file = "results/cox_km_result.RData")
}Visualization of comparison profile
This section we visualize comparsion profile, integrate analysis results of 3 immunotherapy datasets into a big figure.
# make sure these package are loaded
library(pROC)
library(ggsci)
library(ggplot2)
suppressPackageStartupMessages(library(cowplot))Integrate ROC curve.
theme_set(cowplot::theme_cowplot())
##-------- ROC
# Van Allen
p_roc_1 = ggroc(list(TMB=roc_df$TMB[[2]], TIDE=roc_df$TIDE[[2]], TIGS=roc_df$TIGS[[2]]), legacy.axes = TRUE) +
labs(color = "") + scale_color_manual(values = c("blue", "red", "black")) +
theme(legend.position = c(0.8,0.4))
# Hugo
p_roc_2 = ggroc(list(TMB=roc_df$TMB[[1]], TIGS=roc_df$TIGS[[1]], TIDE=roc_df$TIDE[[1]]), legacy.axes = TRUE) +
labs(color = "") + scale_color_manual(values = c("blue", "red", "black")) +
theme(legend.position = c(0.8,0.4))
# Snyder
p_roc_3 = ggroc(list(TMB=roc_df$TMB[[3]], TIGS=roc_df$TIGS[[3]], TIDE=roc_df$TIDE[[3]]), legacy.axes = TRUE) +
labs(color = "") + scale_color_manual(values = c("blue", "red", "black")) +
theme(legend.position = c(0.8,0.4))
plot_grid(p_roc_1, p_roc_2, p_roc_3, nrow = 3)##-------- AUC values
auc_df_long = auc_df %>%
tidyr::gather(key = "biomarker", value = "auc", -Group)
#p_auc_1 =
auc_df_long$auc = round(auc_df_long$auc, digits = 3)
auc_df_long %>%
filter(Group == "VanAllen2015") %>%
ggplot(aes(x=biomarker, y=auc)) +
geom_bar(stat = "identity", fill="lightblue") +
geom_text(aes(label=auc), hjust=1) +
geom_hline(yintercept = 0.5, linetype=2) +
ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
coord_flip() -> p_auc_1
auc_df_long %>%
filter(Group == "Hugo2016") %>%
ggplot(aes(x=biomarker, y=auc)) +
geom_bar(stat = "identity", fill="lightblue") +
geom_text(aes(label=auc), hjust=1) +
geom_hline(yintercept = 0.5, linetype=2) +
ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
coord_flip() -> p_auc_2
auc_df_long %>%
filter(Group == "Snyder2017") %>%
ggplot(aes(x=biomarker, y=auc)) +
geom_bar(stat = "identity", fill="lightblue") +
geom_text(aes(label=auc), hjust=1) +
geom_hline(yintercept = 0.5, linetype=2) +
ylim(0, 0.8) + ylab("AUC") + theme(axis.title.y = element_blank()) +
coord_flip() -> p_auc_3#--- KM-plots
p_surv_APS = list()
p_surv_TMB = list()
p_surv_TIGS = list()
p_surv_TIDE = list()
p_surv_APS[[1]] = ggsurvplot(km_APS$VanAllen, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("APS High", "APS Low"))
p_surv_APS[[2]] = ggsurvplot(km_APS$Hugo, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("APS High", "APS Low"))
p_surv_APS[[3]] = ggsurvplot(km_APS$Snyder, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("APS High", "APS Low"))
p_surv_TMB[[1]] = ggsurvplot(km_TMB$VanAllen, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TMB High", "TMB Low"))
p_surv_TMB[[2]] = ggsurvplot(km_TMB$Hugo, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TMB High", "TMB Low"))
p_surv_TMB[[3]] = ggsurvplot(km_TMB$Snyder, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TMB High", "TMB Low"))
p_surv_TIGS[[1]] = ggsurvplot(km_TIGS$VanAllen, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIGS High", "TIGS Low"))
p_surv_TIGS[[2]] = ggsurvplot(km_TIGS$Hugo, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.4), legend.title=element_blank(),legend.labs = c("TIGS High", "TIGS Low"))
p_surv_TIGS[[3]] = ggsurvplot(km_TIGS$Snyder, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIGS High", "TIGS Low"))
p_surv_TIDE[[1]] = ggsurvplot(km_TIDE$VanAllen, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIDE High", "TIDE Low"))
p_surv_TIDE[[2]] = ggsurvplot(km_TIDE$Hugo, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIDE High", "TIDE Low"))
p_surv_TIDE[[3]] = ggsurvplot(km_TIDE$Snyder, pval = TRUE, fun = "pct",
xlab = "Time (in days)", palette = c("red", "blue"),
legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIDE High", "TIDE Low"))
#ggpar(p , font.legend = c(12, "bold", "black"))plot_list = list(
p_roc_1, p_auc_1, p_surv_TMB[[1]]$plot, p_surv_TIGS[[1]]$plot, p_surv_TIDE[[1]]$plot,
p_roc_2, p_auc_2, p_surv_TMB[[2]]$plot, p_surv_TIGS[[2]]$plot, p_surv_TIDE[[2]]$plot,
p_roc_3, p_auc_3, p_surv_TMB[[3]]$plot, p_surv_TIGS[[3]]$plot, p_surv_TIDE[[3]]$plot
)
# plot_grid(p_roc_1, p_roc_2, p_roc_3,
# p_auc_1, p_auc_2, p_auc_3,
# p_surv_TMB[[1]]$plot, p_surv_TMB[[2]]$plot, p_surv_TMB[[3]]$plot,
# p_surv_TIGS[[1]]$plot, p_surv_TIGS[[2]]$plot, p_surv_TIGS[[3]]$plot,
# p_surv_TIDE[[1]]$plot, p_surv_TIDE[[2]]$plot, p_surv_TIDE[[3]]$plot, nrow = 3, ncol=5,
# align = "vh")
plot_res = plot_grid(plotlist = plot_list, nrow=3, ncol=5,
labels = c(
"a", "d", "g", "j", "m",
"b", "e", "h", "k", "n",
"c", "f", "i", "l", "o"
))
save_plot("comp_profile.png", plot_res, nrow=3, ncol=5)
knitr::include_graphics("comp_profile.png", dpi = 300)